Customer Churn Analysis¶

This is a dataset of an ecommerce company and we have some customers who are churning (leaving).

Goals¶

  1. Perform exploratory analysis of the provided customer data to share insights of the behavior and characteristics of the customers. Make suggestions to help the company with customer retention

  2. Build a predictive model to identify customers who are at risk of leaving the company (churn) based on the provided variables. This can help the company take proactive steps to retain these customers and reduce the rate of churn

Data description¶

  1. CustomerID

  2. Churn: Churn Flag

  3. Tenure: in months

  4. PreferredLoginDevice

  5. CityTier

  6. WarehouseToHome: Distance in between warehouse to home of customer

  7. PreferredPaymentMode

  8. Gender

  9. HourSpendOnApp

  10. NumberOfDeviceRegistered

  11. PreferedOrderCat

  12. SatisfactionScore

  13. MaritalStatus

  14. NumberOfAddress

  15. OrderAmountHikeFromlastYear: Percentage increases in order from last year

  16. CouponUsed: Total number of coupon has been used in last month

  17. OrderCount: Total number of orders has been places in last month

  18. DaySinceLastOrder

  19. CashbackAmount: Average cashback in last month

  20. Complain: Complain flag - if the customer ever had a complain

Tips¶

  1. Sharing your thoughts and reasoning as you go will help!
  2. Feel free to use any libraries like scikit-learn, use stackoverflow. Don't use ChatGPT and similar.
  3. If you are unable to complete any step I can provide help towards the answer. Demonstrating understanding of the solution and result will earn some points!

Import the libraries¶

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

1. Data overview¶

Question: Read the sheet named 'E Comm' from file 'E Commerce Dataset.xlsx' saved in current directory into df variable. Print first 5 rows of the dataframe¶

In [2]:
df = pd.read_excel('E Commerce Dataset.xlsx')
df.head(5)
Out[2]:
CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome PreferredPaymentMode Gender HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat SatisfactionScore MaritalStatus NumberOfAddress Complain OrderAmountHikeFromlastYear CouponUsed OrderCount DaySinceLastOrder CashbackAmount
0 50001 1 4.0 Mobile Phone 3 6.0 Debit Card Female 3.0 3 Laptop & Accessory 2 Single 9 1 11.0 1.0 1.0 5.0 159.93
1 50002 1 NaN Phone 1 8.0 UPI Male 3.0 4 Mobile 3 Single 7 1 15.0 0.0 1.0 0.0 120.90
2 50003 1 NaN Phone 1 30.0 Debit Card Male 2.0 4 Mobile 3 Single 6 1 14.0 0.0 1.0 3.0 120.28
3 50004 1 0.0 Phone 3 15.0 Debit Card Male 2.0 4 Laptop & Accessory 5 Single 8 0 23.0 0.0 1.0 3.0 134.07
4 50005 1 0.0 Phone 1 12.0 CC Male NaN 3 Mobile 5 Single 3 0 11.0 1.0 1.0 3.0 129.60
In [3]:
df.shape
Out[3]:
(5630, 20)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   CustomerID                   5630 non-null   int64  
 1   Churn                        5630 non-null   int64  
 2   Tenure                       5366 non-null   float64
 3   PreferredLoginDevice         5630 non-null   object 
 4   CityTier                     5630 non-null   int64  
 5   WarehouseToHome              5379 non-null   float64
 6   PreferredPaymentMode         5630 non-null   object 
 7   Gender                       5630 non-null   object 
 8   HourSpendOnApp               5375 non-null   float64
 9   NumberOfDeviceRegistered     5630 non-null   int64  
 10  PreferedOrderCat             5630 non-null   object 
 11  SatisfactionScore            5630 non-null   int64  
 12  MaritalStatus                5630 non-null   object 
 13  NumberOfAddress              5630 non-null   int64  
 14  Complain                     5630 non-null   int64  
 15  OrderAmountHikeFromlastYear  5365 non-null   float64
 16  CouponUsed                   5374 non-null   float64
 17  OrderCount                   5372 non-null   float64
 18  DaySinceLastOrder            5323 non-null   float64
 19  CashbackAmount               5630 non-null   float64
dtypes: float64(8), int64(7), object(5)
memory usage: 879.8+ KB

Question: How many unique values are in each column?¶

In [5]:
for columns in df:
  print(columns,df[columns].nunique())
CustomerID 5630
Churn 2
Tenure 36
PreferredLoginDevice 3
CityTier 3
WarehouseToHome 34
PreferredPaymentMode 7
Gender 2
HourSpendOnApp 6
NumberOfDeviceRegistered 6
PreferedOrderCat 6
SatisfactionScore 5
MaritalStatus 3
NumberOfAddress 15
Complain 2
OrderAmountHikeFromlastYear 16
CouponUsed 17
OrderCount 16
DaySinceLastOrder 22
CashbackAmount 2586

Question: Calculate average churn rate¶

Note: Churn = 1 means customer has churned

In [6]:
df[df['Churn'] == 1].count()
Out[6]:
CustomerID                     948
Churn                          948
Tenure                         867
PreferredLoginDevice           948
CityTier                       948
WarehouseToHome                864
PreferredPaymentMode           948
Gender                         948
HourSpendOnApp                 890
NumberOfDeviceRegistered       948
PreferedOrderCat               948
SatisfactionScore              948
MaritalStatus                  948
NumberOfAddress                948
Complain                       948
OrderAmountHikeFromlastYear    934
CouponUsed                     940
OrderCount                     930
DaySinceLastOrder              894
CashbackAmount                 948
dtype: int64
In [7]:
#check if there is any n/a
df['Churn'].isna().nunique()
Out[7]:
1
In [8]:
#churn rate
df[df['Churn'] == 1].shape[0]/df.shape[0]*100
Out[8]:
16.838365896980463

Question: How many missing values / nulls are there in each column?¶

In [9]:
for col in df.columns:
  print(col, df[col].isna().sum())
CustomerID 0
Churn 0
Tenure 264
PreferredLoginDevice 0
CityTier 0
WarehouseToHome 251
PreferredPaymentMode 0
Gender 0
HourSpendOnApp 255
NumberOfDeviceRegistered 0
PreferedOrderCat 0
SatisfactionScore 0
MaritalStatus 0
NumberOfAddress 0
Complain 0
OrderAmountHikeFromlastYear 265
CouponUsed 256
OrderCount 258
DaySinceLastOrder 307
CashbackAmount 0

2. Exploratory Data Analysis¶

Univariate analysis¶

Here we will understand select variables

1. Numeric variables¶

Question: Histogram¶

Show histograms for all numeric columns. Describe the each variable's distribution briefly to a business stakeholder

In [10]:
numerical_df = df[['CustomerID','Churn','Tenure','CityTier','WarehouseToHome','HourSpendOnApp','NumberOfDeviceRegistered','SatisfactionScore','NumberOfAddress','Complain', 'OrderAmountHikeFromlastYear',	'CouponUsed',	'OrderCount',	'DaySinceLastOrder',	'CashbackAmount']]
numerical_df.head(5)
Out[10]:
CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain OrderAmountHikeFromlastYear CouponUsed OrderCount DaySinceLastOrder CashbackAmount
0 50001 1 4.0 3 6.0 3.0 3 2 9 1 11.0 1.0 1.0 5.0 159.93
1 50002 1 NaN 1 8.0 3.0 4 3 7 1 15.0 0.0 1.0 0.0 120.90
2 50003 1 NaN 1 30.0 2.0 4 3 6 1 14.0 0.0 1.0 3.0 120.28
3 50004 1 0.0 3 15.0 2.0 4 5 8 0 23.0 0.0 1.0 3.0 134.07
4 50005 1 0.0 1 12.0 NaN 3 5 3 0 11.0 1.0 1.0 3.0 129.60
In [11]:
categorical_df = df[['PreferredLoginDevice','PreferredPaymentMode','Gender','PreferedOrderCat','MaritalStatus']]
categorical_df.head(5)
Out[11]:
PreferredLoginDevice PreferredPaymentMode Gender PreferedOrderCat MaritalStatus
0 Mobile Phone Debit Card Female Laptop & Accessory Single
1 Phone UPI Male Mobile Single
2 Phone Debit Card Male Mobile Single
3 Phone Debit Card Male Laptop & Accessory Single
4 Phone CC Male Mobile Single
In [12]:
sns.histplot(data=numerical_df, x="CityTier")
Out[12]:
<AxesSubplot:xlabel='CityTier', ylabel='Count'>
In [13]:
sns.pairplot(numerical_df,hue="Churn")
Out[13]:
<seaborn.axisgrid.PairGrid at 0x7fd73b7e36a0>

2. Non numeric columns¶

Cleaning (ignore)

In [14]:
#As mobile phone and phone are both same so we have merged them
df.loc[df['PreferredLoginDevice'] == 'Phone', 'PreferredLoginDevice' ] = 'Mobile Phone'
df.loc[df['PreferedOrderCat'] == 'Mobile', 'PreferedOrderCat' ] = 'Mobile Phone'
#as cod is also cash on delievery
#as cc is also credit card so i merged them
df.loc[df['PreferredPaymentMode'] == 'COD', 'PreferredPaymentMode' ] = 'Cash on Delivery'   # uses loc function
df.loc[df['PreferredPaymentMode'] == 'CC', 'PreferredPaymentMode' ] = 'Credit Card'

Question: Show the unique values of each variable and print the number of occurences of each value¶

Describe these results briefly

In [15]:
for col in df.columns:
    if df[col].dtype == object:
      print(df[col].unique(), df[col].value_counts())
['Mobile Phone' 'Computer'] Mobile Phone    3996
Computer        1634
Name: PreferredLoginDevice, dtype: int64
['Debit Card' 'UPI' 'Credit Card' 'Cash on Delivery' 'E wallet'] Debit Card          2314
Credit Card         1774
E wallet             614
Cash on Delivery     514
UPI                  414
Name: PreferredPaymentMode, dtype: int64
['Female' 'Male'] Male      3384
Female    2246
Name: Gender, dtype: int64
['Laptop & Accessory' 'Mobile Phone' 'Others' 'Fashion' 'Grocery'] Mobile Phone          2080
Laptop & Accessory    2050
Fashion                826
Grocery                410
Others                 264
Name: PreferedOrderCat, dtype: int64
['Single' 'Divorced' 'Married'] Married     2986
Single      1796
Divorced     848
Name: MaritalStatus, dtype: int64

Analysing the Churn by select variables¶

Provide business recommendation for each of the below

Question: Relation between complains and churn¶

In [16]:
df['Churn'].corr(df['Complain'])
Out[16]:
0.25018825469703104
In [20]:
churn_rate_complain = df[df['Complain'] == 1]['Churn'].mean()
churn_rate_no_complain = df[df['Complain'] == 0]['Churn'].mean()
In [21]:
print("Churn rate for Complain =", churn_rate_complain)
print("Churn rate for No Complain =",churn_rate_no_complain)
Churn rate for Complain = 0.3167082294264339
Churn rate for No Complain = 0.1092896174863388

Correlation matrix¶

Visualize the correlation between all variables

Question: Do we need to do any preprocessing on categorical variables before calculating correlation?¶

In [22]:
#Onc hot coding for categorical dataset
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
new_df = enc.fit_transform(categorical_df)
df_ohc = pd.DataFrame(new_df.toarray(), columns=enc.get_feature_names_out(), dtype=int)
In [23]:
df_ohc.head(5)
Out[23]:
PreferredLoginDevice_Computer PreferredLoginDevice_Mobile Phone PreferredLoginDevice_Phone PreferredPaymentMode_CC PreferredPaymentMode_COD PreferredPaymentMode_Cash on Delivery PreferredPaymentMode_Credit Card PreferredPaymentMode_Debit Card PreferredPaymentMode_E wallet PreferredPaymentMode_UPI ... Gender_Male PreferedOrderCat_Fashion PreferedOrderCat_Grocery PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile PreferedOrderCat_Mobile Phone PreferedOrderCat_Others MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single
0 0 1 0 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0 1 ... 1 0 0 0 1 0 0 0 0 1
2 0 0 1 0 0 0 0 1 0 0 ... 1 0 0 0 1 0 0 0 0 1
3 0 0 1 0 0 0 0 1 0 0 ... 1 0 0 1 0 0 0 0 0 1
4 0 0 1 1 0 0 0 0 0 0 ... 1 0 0 0 1 0 0 0 0 1

5 rows × 21 columns

In [24]:
df_final = pd.concat([numerical_df, df_ohc], axis=1)
df_final.head(5)
Out[24]:
CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain ... Gender_Male PreferedOrderCat_Fashion PreferedOrderCat_Grocery PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile PreferedOrderCat_Mobile Phone PreferedOrderCat_Others MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single
0 50001 1 4.0 3 6.0 3.0 3 2 9 1 ... 0 0 0 1 0 0 0 0 0 1
1 50002 1 NaN 1 8.0 3.0 4 3 7 1 ... 1 0 0 0 1 0 0 0 0 1
2 50003 1 NaN 1 30.0 2.0 4 3 6 1 ... 1 0 0 0 1 0 0 0 0 1
3 50004 1 0.0 3 15.0 2.0 4 5 8 0 ... 1 0 0 1 0 0 0 0 0 1
4 50005 1 0.0 1 12.0 NaN 3 5 3 0 ... 1 0 0 0 1 0 0 0 0 1

5 rows × 36 columns

In [25]:
df_final = df_final.dropna()
df_final.head(5)
Out[25]:
CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain ... Gender_Male PreferedOrderCat_Fashion PreferedOrderCat_Grocery PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile PreferedOrderCat_Mobile Phone PreferedOrderCat_Others MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single
0 50001 1 4.0 3 6.0 3.0 3 2 9 1 ... 0 0 0 1 0 0 0 0 0 1
3 50004 1 0.0 3 15.0 2.0 4 5 8 0 ... 1 0 0 1 0 0 0 0 0 1
5 50006 1 0.0 1 22.0 3.0 5 5 2 1 ... 0 0 0 0 0 1 0 0 0 1
11 50012 1 11.0 1 6.0 3.0 4 3 10 1 ... 1 1 0 0 0 0 0 0 0 1
12 50013 1 0.0 1 11.0 2.0 3 3 2 1 ... 1 0 0 0 1 0 0 0 0 1

5 rows × 36 columns

Question: Plot correlation matrix¶

Discuss a few significant correlations

In [26]:
#CustomerID column as it's not relevant for correlation
df_final = df_final.drop(columns=['CustomerID'])
df_final.head(5)
Out[26]:
Churn Tenure CityTier WarehouseToHome HourSpendOnApp NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain OrderAmountHikeFromlastYear ... Gender_Male PreferedOrderCat_Fashion PreferedOrderCat_Grocery PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile PreferedOrderCat_Mobile Phone PreferedOrderCat_Others MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single
0 1 4.0 3 6.0 3.0 3 2 9 1 11.0 ... 0 0 0 1 0 0 0 0 0 1
3 1 0.0 3 15.0 2.0 4 5 8 0 23.0 ... 1 0 0 1 0 0 0 0 0 1
5 1 0.0 1 22.0 3.0 5 5 2 1 22.0 ... 0 0 0 0 0 1 0 0 0 1
11 1 11.0 1 6.0 3.0 4 3 10 1 13.0 ... 1 1 0 0 0 0 0 0 0 1
12 1 0.0 1 11.0 2.0 3 3 2 1 13.0 ... 1 0 0 0 1 0 0 0 0 1

5 rows × 35 columns

In [28]:
corr_matrix = df_final.corr()
corr_matrix
Out[28]:
Churn Tenure CityTier WarehouseToHome HourSpendOnApp NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain OrderAmountHikeFromlastYear ... Gender_Male PreferedOrderCat_Fashion PreferedOrderCat_Grocery PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile PreferedOrderCat_Mobile Phone PreferedOrderCat_Others MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single
Churn 1.000000 -0.340013 0.073858 0.087318 0.060845 0.149041 0.095759 0.076336 0.238137 0.017193 ... 0.033792 0.013086 -0.017880 -0.184584 0.016675 0.181683 0.010208 -0.013023 -0.159808 0.179481
Tenure -0.340013 1.000000 -0.057414 -0.016353 -0.029818 -0.058752 -0.009972 0.196547 -0.035228 0.012197 ... -0.052123 0.130259 0.091251 0.104482 -0.033466 -0.211477 0.103702 0.015788 0.085990 -0.103150
CityTier 0.073858 -0.057414 1.000000 0.004457 -0.070035 -0.008616 -0.019494 -0.046406 -0.006122 -0.027628 ... -0.018072 0.083035 -0.001750 0.245824 -0.097493 -0.281079 -0.011248 0.019181 -0.029876 0.017370
WarehouseToHome 0.087318 -0.016353 0.004457 1.000000 0.052731 0.024582 0.000434 0.003422 0.003829 0.031975 ... 0.035581 -0.009435 0.003512 0.043231 -0.015441 -0.028939 -0.035530 -0.013650 0.038336 -0.030496
HourSpendOnApp 0.060845 -0.029818 -0.070035 0.052731 1.000000 0.293021 0.039879 0.124962 0.020413 0.096827 ... -0.009952 -0.147830 -0.017382 -0.130819 -0.112896 0.281309 0.023109 0.018024 0.022257 -0.037134
NumberOfDeviceRegistered 0.149041 -0.058752 -0.008616 0.024582 0.293021 1.000000 -0.017788 0.066684 0.018881 0.083342 ... -0.026706 -0.091604 0.022548 -0.080608 -0.125343 0.188563 0.035364 -0.001099 -0.064517 0.069345
SatisfactionScore 0.095759 -0.009972 -0.019494 0.000434 0.039879 -0.017788 1.000000 0.054627 -0.044533 -0.008143 ... -0.032398 0.016587 0.027097 -0.025448 -0.006238 0.011297 0.030401 0.051528 -0.012167 -0.025659
NumberOfAddress 0.076336 0.196547 -0.046406 0.003422 0.124962 0.066684 0.054627 1.000000 -0.017363 0.010275 ... -0.030980 -0.040136 -0.005968 0.059515 -0.031229 -0.023102 -0.004349 -0.024802 -0.001463 0.020125
Complain 0.238137 -0.035228 -0.006122 0.003829 0.020413 0.018881 -0.044533 -0.017363 1.000000 0.003260 ... -0.039522 -0.000022 0.004534 -0.018126 -0.005327 0.019793 0.007866 0.012777 -0.000364 -0.009180
OrderAmountHikeFromlastYear 0.017193 0.012197 -0.027628 0.031975 0.096827 0.083342 -0.008143 0.010275 0.003260 1.000000 ... 0.010081 -0.051335 0.041476 -0.024964 -0.067089 0.080301 0.021090 -0.014517 0.014894 -0.004949
CouponUsed 0.010982 0.075959 0.013313 -0.014311 0.160239 0.114066 0.006687 -0.017851 0.005472 0.048959 ... -0.041493 0.078762 0.078182 -0.038409 -0.088426 0.001006 0.082800 0.043312 0.010976 -0.044087
OrderCount -0.001962 0.112339 0.035018 0.008800 0.092601 0.069559 -0.004622 -0.079413 -0.001917 0.042950 ... -0.035418 0.154599 0.126821 -0.018610 -0.082487 -0.083855 0.118690 0.022122 0.043676 -0.062952
DaySinceLastOrder -0.139254 0.118932 0.014304 0.010646 0.019526 -0.049817 0.012950 -0.148784 -0.056315 0.017716 ... -0.018478 0.157977 0.035204 0.122257 -0.082597 -0.221627 0.071340 -0.007111 0.034002 -0.030789
CashbackAmount -0.058756 0.213852 0.146569 -0.012433 0.131281 0.113504 0.012061 0.096752 0.010389 0.014215 ... -0.020263 0.455466 0.014736 0.101787 -0.232654 -0.401303 0.443994 0.013801 0.034234 -0.046693
PreferredLoginDevice_Computer 0.040890 -0.010308 -0.000184 0.022101 -0.004911 0.025628 -0.035021 -0.024398 -0.014878 -0.000220 ... 0.025482 -0.054930 0.032595 0.056688 -0.036701 -0.012669 0.005916 -0.016560 -0.006364 0.019158
PreferredLoginDevice_Mobile Phone -0.079120 0.085754 0.098949 -0.002403 0.106765 0.042948 -0.007731 0.078405 0.012571 0.004871 ... -0.037886 0.188988 -0.014342 0.093415 -0.185187 -0.162324 0.021285 0.005117 0.030007 -0.035702
PreferredLoginDevice_Phone 0.053023 -0.096773 -0.125199 -0.022497 -0.129642 -0.084051 0.050272 -0.071176 0.001261 -0.005919 ... 0.018568 -0.176046 -0.019492 -0.183909 0.277125 0.220374 -0.033815 0.012652 -0.030677 0.023109
PreferredPaymentMode_CC 0.030729 -0.032474 -0.073109 0.007388 -0.054905 -0.036118 0.013975 0.018801 0.019183 -0.028536 ... 0.010948 -0.035283 -0.003861 -0.089557 0.504562 -0.067153 -0.006698 0.015132 -0.018716 0.008548
PreferredPaymentMode_COD 0.059430 -0.027497 -0.049111 -0.004197 -0.004520 0.002875 0.011974 -0.009988 -0.017256 -0.059288 ... 0.027718 -0.064820 -0.011748 -0.053643 0.053221 0.085898 -0.020380 0.006594 -0.029529 0.026426
PreferredPaymentMode_Cash on Delivery -0.006498 0.069208 -0.004977 -0.007623 0.016056 -0.011988 0.005600 -0.004053 -0.013373 0.000700 ... -0.009100 0.193691 0.054822 -0.065981 -0.020480 -0.068682 0.060785 0.007005 0.017955 -0.024315
PreferredPaymentMode_Credit Card -0.029392 -0.024243 -0.150637 0.036141 0.033820 0.000782 0.043184 0.003998 0.003623 0.036106 ... -0.028839 -0.039491 0.032185 -0.033678 -0.094304 0.098421 -0.019855 -0.001500 0.003140 -0.002212
PreferredPaymentMode_Debit Card -0.039232 0.040958 -0.127528 -0.041475 0.020115 0.007461 -0.073269 0.013272 -0.002413 0.010242 ... 0.004968 -0.004245 -0.019560 -0.001249 0.026243 -0.008097 0.028682 -0.010594 -0.011571 0.020221
PreferredPaymentMode_E wallet 0.043972 0.016709 0.503165 0.033578 -0.039522 -0.008889 0.026658 -0.027925 0.001807 -0.036139 ... -0.022827 0.074168 -0.014552 0.141394 -0.065803 -0.175791 -0.001349 0.020559 0.017060 -0.033513
PreferredPaymentMode_UPI 0.000938 -0.043942 -0.036088 -0.021470 -0.030052 0.010839 0.009321 0.007556 0.012421 0.033331 ... 0.039949 -0.016992 -0.011405 0.003837 -0.034355 0.024280 -0.019785 -0.017968 0.024753 -0.012836
Gender_Female -0.033792 0.052123 0.018072 -0.035581 0.009952 0.026706 0.032398 0.030980 0.039522 -0.010081 ... -1.000000 0.009374 -0.005292 0.043367 0.001884 -0.053910 0.006532 0.012541 -0.035047 0.027833
Gender_Male 0.033792 -0.052123 -0.018072 0.035581 -0.009952 -0.026706 -0.032398 -0.030980 -0.039522 0.010081 ... 1.000000 -0.009374 0.005292 -0.043367 -0.001884 0.053910 -0.006532 -0.012541 0.035047 -0.027833
PreferedOrderCat_Fashion 0.013086 0.130259 0.083035 -0.009435 -0.147830 -0.091604 0.016587 -0.040136 -0.000022 -0.051335 ... -0.009374 1.000000 -0.014552 -0.379275 -0.065803 -0.253118 -0.025246 0.004190 0.010466 -0.014253
PreferedOrderCat_Grocery -0.017880 0.091251 -0.001750 0.003512 -0.017382 0.022548 0.027097 -0.005968 0.004534 0.041476 ... 0.005292 -0.014552 1.000000 -0.041501 -0.007200 -0.027697 -0.002762 -0.016429 0.024625 -0.013853
PreferedOrderCat_Laptop & Accessory -0.184584 0.104482 0.245824 0.043231 -0.130819 -0.080608 -0.025448 0.059515 -0.018126 -0.024964 ... -0.043367 -0.379275 -0.041501 1.000000 -0.187659 -0.721851 -0.071997 -0.000340 0.057496 -0.060812
PreferedOrderCat_Mobile 0.016675 -0.033466 -0.097493 -0.015441 -0.112896 -0.125343 -0.006238 -0.031229 -0.005327 -0.067089 ... -0.001884 -0.065803 -0.007200 -0.187659 1.000000 -0.125238 -0.012491 -0.009681 -0.028832 0.037872
PreferedOrderCat_Mobile Phone 0.181683 -0.211477 -0.281079 -0.028939 0.281309 0.188563 0.011297 -0.023102 0.019793 0.080301 ... 0.053910 -0.253118 -0.027697 -0.721851 -0.125238 1.000000 -0.048049 0.001864 -0.062740 0.065241
PreferedOrderCat_Others 0.010208 0.103702 -0.011248 -0.035530 0.023109 0.035364 0.030401 -0.004349 0.007866 0.021090 ... -0.006532 -0.025246 -0.002762 -0.071997 -0.012491 -0.048049 1.000000 0.004273 0.019615 -0.024032
MaritalStatus_Divorced -0.013023 0.015788 0.019181 -0.013650 0.018024 -0.001099 0.051528 -0.024802 0.012777 -0.014517 ... -0.012541 0.004190 -0.016429 -0.000340 -0.009681 0.001864 0.004273 1.000000 -0.432989 -0.288871
MaritalStatus_Married -0.159808 0.085990 -0.029876 0.038336 0.022257 -0.064517 -0.012167 -0.001463 -0.000364 0.014894 ... 0.035047 0.010466 0.024625 0.057496 -0.028832 -0.062740 0.019615 -0.432989 1.000000 -0.737892
MaritalStatus_Single 0.179481 -0.103150 0.017370 -0.030496 -0.037134 0.069345 -0.025659 0.020125 -0.009180 -0.004949 ... -0.027833 -0.014253 -0.013853 -0.060812 0.037872 0.065241 -0.024032 -0.288871 -0.737892 1.000000

35 rows × 35 columns

In [31]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(10, 8), dpi=150)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.8)
plt.show()

Optional: What is the correlation of each feature with target¶

Sort the correlation in descending order

In [35]:
corr_matrix.sort_values(by = 'Churn', ascending=False)
Out[35]:
Churn Tenure CityTier WarehouseToHome HourSpendOnApp NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain OrderAmountHikeFromlastYear ... Gender_Male PreferedOrderCat_Fashion PreferedOrderCat_Grocery PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile PreferedOrderCat_Mobile Phone PreferedOrderCat_Others MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single
Churn 1.000000 -0.340013 0.073858 0.087318 0.060845 0.149041 0.095759 0.076336 0.238137 0.017193 ... 0.033792 0.013086 -0.017880 -0.184584 0.016675 0.181683 0.010208 -0.013023 -0.159808 0.179481
Complain 0.238137 -0.035228 -0.006122 0.003829 0.020413 0.018881 -0.044533 -0.017363 1.000000 0.003260 ... -0.039522 -0.000022 0.004534 -0.018126 -0.005327 0.019793 0.007866 0.012777 -0.000364 -0.009180
PreferedOrderCat_Mobile Phone 0.181683 -0.211477 -0.281079 -0.028939 0.281309 0.188563 0.011297 -0.023102 0.019793 0.080301 ... 0.053910 -0.253118 -0.027697 -0.721851 -0.125238 1.000000 -0.048049 0.001864 -0.062740 0.065241
MaritalStatus_Single 0.179481 -0.103150 0.017370 -0.030496 -0.037134 0.069345 -0.025659 0.020125 -0.009180 -0.004949 ... -0.027833 -0.014253 -0.013853 -0.060812 0.037872 0.065241 -0.024032 -0.288871 -0.737892 1.000000
NumberOfDeviceRegistered 0.149041 -0.058752 -0.008616 0.024582 0.293021 1.000000 -0.017788 0.066684 0.018881 0.083342 ... -0.026706 -0.091604 0.022548 -0.080608 -0.125343 0.188563 0.035364 -0.001099 -0.064517 0.069345
SatisfactionScore 0.095759 -0.009972 -0.019494 0.000434 0.039879 -0.017788 1.000000 0.054627 -0.044533 -0.008143 ... -0.032398 0.016587 0.027097 -0.025448 -0.006238 0.011297 0.030401 0.051528 -0.012167 -0.025659
WarehouseToHome 0.087318 -0.016353 0.004457 1.000000 0.052731 0.024582 0.000434 0.003422 0.003829 0.031975 ... 0.035581 -0.009435 0.003512 0.043231 -0.015441 -0.028939 -0.035530 -0.013650 0.038336 -0.030496
NumberOfAddress 0.076336 0.196547 -0.046406 0.003422 0.124962 0.066684 0.054627 1.000000 -0.017363 0.010275 ... -0.030980 -0.040136 -0.005968 0.059515 -0.031229 -0.023102 -0.004349 -0.024802 -0.001463 0.020125
CityTier 0.073858 -0.057414 1.000000 0.004457 -0.070035 -0.008616 -0.019494 -0.046406 -0.006122 -0.027628 ... -0.018072 0.083035 -0.001750 0.245824 -0.097493 -0.281079 -0.011248 0.019181 -0.029876 0.017370
HourSpendOnApp 0.060845 -0.029818 -0.070035 0.052731 1.000000 0.293021 0.039879 0.124962 0.020413 0.096827 ... -0.009952 -0.147830 -0.017382 -0.130819 -0.112896 0.281309 0.023109 0.018024 0.022257 -0.037134
PreferredPaymentMode_COD 0.059430 -0.027497 -0.049111 -0.004197 -0.004520 0.002875 0.011974 -0.009988 -0.017256 -0.059288 ... 0.027718 -0.064820 -0.011748 -0.053643 0.053221 0.085898 -0.020380 0.006594 -0.029529 0.026426
PreferredLoginDevice_Phone 0.053023 -0.096773 -0.125199 -0.022497 -0.129642 -0.084051 0.050272 -0.071176 0.001261 -0.005919 ... 0.018568 -0.176046 -0.019492 -0.183909 0.277125 0.220374 -0.033815 0.012652 -0.030677 0.023109
PreferredPaymentMode_E wallet 0.043972 0.016709 0.503165 0.033578 -0.039522 -0.008889 0.026658 -0.027925 0.001807 -0.036139 ... -0.022827 0.074168 -0.014552 0.141394 -0.065803 -0.175791 -0.001349 0.020559 0.017060 -0.033513
PreferredLoginDevice_Computer 0.040890 -0.010308 -0.000184 0.022101 -0.004911 0.025628 -0.035021 -0.024398 -0.014878 -0.000220 ... 0.025482 -0.054930 0.032595 0.056688 -0.036701 -0.012669 0.005916 -0.016560 -0.006364 0.019158
Gender_Male 0.033792 -0.052123 -0.018072 0.035581 -0.009952 -0.026706 -0.032398 -0.030980 -0.039522 0.010081 ... 1.000000 -0.009374 0.005292 -0.043367 -0.001884 0.053910 -0.006532 -0.012541 0.035047 -0.027833
PreferredPaymentMode_CC 0.030729 -0.032474 -0.073109 0.007388 -0.054905 -0.036118 0.013975 0.018801 0.019183 -0.028536 ... 0.010948 -0.035283 -0.003861 -0.089557 0.504562 -0.067153 -0.006698 0.015132 -0.018716 0.008548
OrderAmountHikeFromlastYear 0.017193 0.012197 -0.027628 0.031975 0.096827 0.083342 -0.008143 0.010275 0.003260 1.000000 ... 0.010081 -0.051335 0.041476 -0.024964 -0.067089 0.080301 0.021090 -0.014517 0.014894 -0.004949
PreferedOrderCat_Mobile 0.016675 -0.033466 -0.097493 -0.015441 -0.112896 -0.125343 -0.006238 -0.031229 -0.005327 -0.067089 ... -0.001884 -0.065803 -0.007200 -0.187659 1.000000 -0.125238 -0.012491 -0.009681 -0.028832 0.037872
PreferedOrderCat_Fashion 0.013086 0.130259 0.083035 -0.009435 -0.147830 -0.091604 0.016587 -0.040136 -0.000022 -0.051335 ... -0.009374 1.000000 -0.014552 -0.379275 -0.065803 -0.253118 -0.025246 0.004190 0.010466 -0.014253
CouponUsed 0.010982 0.075959 0.013313 -0.014311 0.160239 0.114066 0.006687 -0.017851 0.005472 0.048959 ... -0.041493 0.078762 0.078182 -0.038409 -0.088426 0.001006 0.082800 0.043312 0.010976 -0.044087
PreferedOrderCat_Others 0.010208 0.103702 -0.011248 -0.035530 0.023109 0.035364 0.030401 -0.004349 0.007866 0.021090 ... -0.006532 -0.025246 -0.002762 -0.071997 -0.012491 -0.048049 1.000000 0.004273 0.019615 -0.024032
PreferredPaymentMode_UPI 0.000938 -0.043942 -0.036088 -0.021470 -0.030052 0.010839 0.009321 0.007556 0.012421 0.033331 ... 0.039949 -0.016992 -0.011405 0.003837 -0.034355 0.024280 -0.019785 -0.017968 0.024753 -0.012836
OrderCount -0.001962 0.112339 0.035018 0.008800 0.092601 0.069559 -0.004622 -0.079413 -0.001917 0.042950 ... -0.035418 0.154599 0.126821 -0.018610 -0.082487 -0.083855 0.118690 0.022122 0.043676 -0.062952
PreferredPaymentMode_Cash on Delivery -0.006498 0.069208 -0.004977 -0.007623 0.016056 -0.011988 0.005600 -0.004053 -0.013373 0.000700 ... -0.009100 0.193691 0.054822 -0.065981 -0.020480 -0.068682 0.060785 0.007005 0.017955 -0.024315
MaritalStatus_Divorced -0.013023 0.015788 0.019181 -0.013650 0.018024 -0.001099 0.051528 -0.024802 0.012777 -0.014517 ... -0.012541 0.004190 -0.016429 -0.000340 -0.009681 0.001864 0.004273 1.000000 -0.432989 -0.288871
PreferedOrderCat_Grocery -0.017880 0.091251 -0.001750 0.003512 -0.017382 0.022548 0.027097 -0.005968 0.004534 0.041476 ... 0.005292 -0.014552 1.000000 -0.041501 -0.007200 -0.027697 -0.002762 -0.016429 0.024625 -0.013853
PreferredPaymentMode_Credit Card -0.029392 -0.024243 -0.150637 0.036141 0.033820 0.000782 0.043184 0.003998 0.003623 0.036106 ... -0.028839 -0.039491 0.032185 -0.033678 -0.094304 0.098421 -0.019855 -0.001500 0.003140 -0.002212
Gender_Female -0.033792 0.052123 0.018072 -0.035581 0.009952 0.026706 0.032398 0.030980 0.039522 -0.010081 ... -1.000000 0.009374 -0.005292 0.043367 0.001884 -0.053910 0.006532 0.012541 -0.035047 0.027833
PreferredPaymentMode_Debit Card -0.039232 0.040958 -0.127528 -0.041475 0.020115 0.007461 -0.073269 0.013272 -0.002413 0.010242 ... 0.004968 -0.004245 -0.019560 -0.001249 0.026243 -0.008097 0.028682 -0.010594 -0.011571 0.020221
CashbackAmount -0.058756 0.213852 0.146569 -0.012433 0.131281 0.113504 0.012061 0.096752 0.010389 0.014215 ... -0.020263 0.455466 0.014736 0.101787 -0.232654 -0.401303 0.443994 0.013801 0.034234 -0.046693
PreferredLoginDevice_Mobile Phone -0.079120 0.085754 0.098949 -0.002403 0.106765 0.042948 -0.007731 0.078405 0.012571 0.004871 ... -0.037886 0.188988 -0.014342 0.093415 -0.185187 -0.162324 0.021285 0.005117 0.030007 -0.035702
DaySinceLastOrder -0.139254 0.118932 0.014304 0.010646 0.019526 -0.049817 0.012950 -0.148784 -0.056315 0.017716 ... -0.018478 0.157977 0.035204 0.122257 -0.082597 -0.221627 0.071340 -0.007111 0.034002 -0.030789
MaritalStatus_Married -0.159808 0.085990 -0.029876 0.038336 0.022257 -0.064517 -0.012167 -0.001463 -0.000364 0.014894 ... 0.035047 0.010466 0.024625 0.057496 -0.028832 -0.062740 0.019615 -0.432989 1.000000 -0.737892
PreferedOrderCat_Laptop & Accessory -0.184584 0.104482 0.245824 0.043231 -0.130819 -0.080608 -0.025448 0.059515 -0.018126 -0.024964 ... -0.043367 -0.379275 -0.041501 1.000000 -0.187659 -0.721851 -0.071997 -0.000340 0.057496 -0.060812
Tenure -0.340013 1.000000 -0.057414 -0.016353 -0.029818 -0.058752 -0.009972 0.196547 -0.035228 0.012197 ... -0.052123 0.130259 0.091251 0.104482 -0.033466 -0.211477 0.103702 0.015788 0.085990 -0.103150

35 rows × 35 columns

3. Modelling¶

Prepare data¶

Fill nulls in each column

In [36]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
In [37]:
for col in df.columns:
    nullCount = df[col].isnull().sum()
    if nullCount > 0:
        df[col]=imputer.fit_transform(df[[col]])
        print("Filled", nullCount, "nulls in", col)
Filled 264 nulls in Tenure
Filled 251 nulls in WarehouseToHome
Filled 255 nulls in HourSpendOnApp
Filled 265 nulls in OrderAmountHikeFromlastYear
Filled 256 nulls in CouponUsed
Filled 258 nulls in OrderCount
Filled 307 nulls in DaySinceLastOrder

Question: Make the data suitable for model training¶

In [39]:
df_final.head(5)
Out[39]:
Churn Tenure CityTier WarehouseToHome HourSpendOnApp NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain OrderAmountHikeFromlastYear ... Gender_Male PreferedOrderCat_Fashion PreferedOrderCat_Grocery PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile PreferedOrderCat_Mobile Phone PreferedOrderCat_Others MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single
0 1 4.0 3 6.0 3.0 3 2 9 1 11.0 ... 0 0 0 1 0 0 0 0 0 1
3 1 0.0 3 15.0 2.0 4 5 8 0 23.0 ... 1 0 0 1 0 0 0 0 0 1
5 1 0.0 1 22.0 3.0 5 5 2 1 22.0 ... 0 0 0 0 0 1 0 0 0 1
11 1 11.0 1 6.0 3.0 4 3 10 1 13.0 ... 1 1 0 0 0 0 0 0 0 1
12 1 0.0 1 11.0 2.0 3 3 2 1 13.0 ... 1 0 0 0 1 0 0 0 0 1

5 rows × 35 columns

In [40]:
from sklearn.linear_model import LogisticRegression
x = df_final.drop(['Churn'],axis = 1)
x.head(5)
y = df_final['Churn']
In [47]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
In [48]:
#check the shape of X_train and X_test
x_train.shape, x_test.shape
Out[48]:
((3019, 34), (755, 34))
In [49]:
clf = LogisticRegression(random_state=0).fit(x_train,y_train)
/Users/jacquie/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [51]:
clf.fit(x_train, y_train)
/Users/jacquie/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[51]:
LogisticRegression(random_state=0)
In [52]:
#predict result
y_pred = clf.predict(x_test)

Model training¶

Question: Train one or more models and show their performance on training and test data¶

Train and show train & test accuracy

In [54]:
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix
train_accuracy = accuracy_score(y_pred, y_test)*100
train_accuracy
Out[54]:
89.27152317880794
In [56]:
y_pred_train = clf.predict(x_train)
y_pred_train
Out[56]:
array([0, 0, 1, ..., 0, 1, 0])
In [57]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))
Training-set accuracy score: 0.8917

4. Evaluation¶

Feel free to use any libraries / stackoverflow

Question: Show the precision, recall and f1 score for the model with best accuracy¶

In [59]:
print(classification_report(y_pred, y_test, digits=6))
              precision    recall  f1-score   support

           0   0.958333  0.920000  0.938776       675
           1   0.495327  0.662500  0.566845        80

    accuracy                       0.892715       755
   macro avg   0.726830  0.791250  0.752810       755
weighted avg   0.909273  0.892715  0.899366       755

print(classification_report(y_train, y_pred_train, digits=6))

Confusion matrix¶

Question: Show the confusion matrix¶

Describe the performance of model and steps you could take to improve it

Bonus: Plot the confusion matrix¶

In [61]:
print('Confusion matrix:\n', confusion_matrix(y_pred, y_test))
Confusion matrix:
 [[621  54]
 [ 27  53]]
In [62]:
print('Confusion matrix:\n', confusion_matrix(y_pred_train, y_train))
Confusion matrix:
 [[2404  236]
 [  91  288]]

Discussion¶

In [64]:
#check for overfitting and underfitting
print('Training set score: {:.4f}'.format(clf.score(x_train, y_train)))
Training set score: 0.8917
In [65]:
print('Test set score: {:.4f}'.format(clf.score(x_test, y_test)))
Test set score: 0.8927

The training-set accuracy score is 0.8917 while the test-set accuracy to be 0.8927. These two values are quite comparable. So, there is no question of overfitting.

Next Step, we could perform hyperparameter optimization using gridsearch CV to improve the performance for this particular model

In [ ]: